A deep dive into advanced evaluation for data scientists, discussing why accuracy is often misleading and exploring alternative metrics for classification and regression tasks like ROC-AUC, Log Loss, R², RMSLE, and Quantile Loss.
The article discusses using Large Language Model (LLM) embeddings as features in traditional machine learning models built with scikit-learn. It covers the process of generating embeddings from text data using models like Sentence Transformers, and how these embeddings can be combined with existing features to improve model performance. It details practical steps including loading data, creating embeddings, and integrating them into a scikit-learn pipeline for tasks like classification.
This page details the topic namers available in Turftopic, allowing automated assignment of human-readable names to topics. It covers Large Language Models (local and OpenAI), N-gram patterns, and provides API references for the `TopicNamer`, `LLMTopicNamer`, `OpenAITopicNamer`, and `NgramTopicNamer` classes.
Python tutorial for reproducible labeling of cutting-edge topic models with GPT4-o-mini. The article details training a FASTopic model and labeling its results using GPT-4.0 mini, emphasizing reproducibility and control over the labeling process.
Multi-class zero-shot embedding classification and error checking. This project improves zero-shot image/text classification using a novel dimensionality reduction technique and pairwise comparison, resulting in increased agreement between text and image classifications.
This article demonstrates how to use the attention mechanism in a time series classification framework, specifically for classifying normal sine waves versus 'modified' (flattened) sine waves. It details the data generation, model implementation (using a bidirectional LSTM with attention), and results, achieving high accuracy.
This article discusses the use of variational autoencoders (VAEs) to generate synthetic data as a solution to the impending data scarcity for training large language models. It explores how synthetic data can address issues like imbalanced datasets, particularly using the UCI Adult dataset, by generating synthetic samples to balance the dataset and improve classification accuracy.
This article provides a comprehensive guide on the basics of BERT (Bidirectional Encoder Representations from Transformers) models. It covers the architecture, use cases, and practical implementations, helping readers understand how to leverage BERT for natural language processing tasks.
This article provides a hands-on guide to classifying human activity using sensor data and machine learning. It covers preparing data, creating a feature extraction pipeline using TSFresh, training a machine learning classifier with scikit-learn, and validating the model using the Data Studio.
A detailed guide on creating a text classification model with Hugging Face's transformer models, including setup, training, and evaluation steps.